本文是来自于AAAI2019的一篇关于句子语义匹配的文章,提出了Dynamic Re-read Network(DRr-Net),核心是通过多次计算注意力,每一次计算可以获取最重要的词信息,从而得到一句话的动态表征,结合句子的静态表征和动态表征来得到相似度。
Introduction
本文的核心思想在于在Semantic Matching 时句子的重要部分应该是动态变化的,应该多次读取利用。
For example, when judging the relation between “a person with a purple shirt is painting an image of a woman on a white wall” and “a woman paints a portrait of her best friend”, the important words will change from “person, purple, shirt, painting, image, woman” to “ person, image, woman” in the first sentence, and from “woman, paints, portrait, best friend” to “woman, portrait, best friend” in the second sentence. As the Chinese proverb says: “The gist of an article will come to you after reading it over 100 times ”.
Dynamic Re-read Network
问题定义:给定两个句子
目标是学习一个分类器预测两者之间的关系(本文研究的是自然语言推理SLI,实际是二分类)。
模型整体结构:
Input Embedding
这部分主要包括Word Embedding和Attention Stack-GRU(ASG)单元。
Word Embedding:针对于句子中的每个词表征,模型使用预训练的词向量、字特征、句法特征拼接,最终得到句子序列表示: $\{a_{i}|i=1,2,…,l_{a}\}, \{b_{j}|j=1,2,…,l_{b}\}$
The character features are obtained by applying a convolutional neural network with a max pooling layer to the learned character embeddings, which can represent words in a finer-granularity and help to avoid the Out-Of-Vocabulary (OOV) problem that pre-trained word vectors suffer from. The syntactical features consist of the embedding of part-of-speech tagging feature, binary exact match feature, and binary antonym feature, which have been proved useful for sentence semantic understanding (Chen et al. 2017a; Gururangan et al. 2018).
Attention Stack-GRU(ASG):得到句子序列表示后,通过一个stack GRU
$H_{l}$ 代表第l层GRU,得到最终的隐层状态输出 $\{h_{i}^{a}|i=1,2,…,l_{a}\}, \{h_{j}^{b}|j=1,2,…,l_{b}\}$ (把所有层的输出拼接)。
之后,使用注意力机制得到句子的整体向量表示:
同理可以得到 $h^{b}$。
Dynamic Re-read Mechanism
Moreover, with an in-depth understanding of the sentence, the important words that should be concerned are dynam-ically changing, even the words that did not get attention before.
如Figure 1(C),使用GRU去编码每一次读取选择的最重要的词信息:
T是动态读取的次数。对于F,使用注意力机制计算:
需要注意的是,index()的过程是不可微的,因此,作者使用了一个softmax函数近似:
$\beta$ 是一个任意大的值,目的是让最重要的词的权重趋向于1,其它词趋向于0。
Label Prediction
针对于静态表征$h^{a}, h^{b}$和动态表征$v^{a}, v^{b}$,分别进行匹配:
where $p^{h}$ and $p^{v}$ denote the probability distribution of different classes with original sentence representations and dynamic sentence representations separately.
因此,模型的损失函数为:(交叉熵)
同时为了增加监督信息,给 $p^{h}, p^{v}$ 两个概率分布也增加交叉熵损失函数,最后增加l2正则化:
Experiment
作者在三个公共数据集上做了实验:
SNLI: The SNLI (Bowman et al. 2015) contains570,152 human annotated sentence pairs. Each sentence pair is labeled with one of the following relations:Entailment,Contradiction,orNeutral.
SICK: The SICK (Marelli et al. 2014) contains10,000 sentence pairs. The labels are the same as SNLI dataset.
Quora: The Quora Question Pair (Iyer, Dandekar, and Csernai 2017) dataset consists of over 400,000 potential question duplicate pairs. Each pair has a binary value that indicates whether the line truly contains a duplicate pair.
When the re-read length is between 5 to 7, DRr-Net achieves the best performance. This phenomenon is consistent with the psychological findings that human attention focuses on nearly 7 words (Tononi 2008).
Conclusion and Future Work
In this paper, we proposed a Dynamic Re-read Network(DRr-Net) approach for sentence semantic matching, a novel architecture that was able to pay close attention to a small region of sentences at each time and re-read the important information for better sentence semantic matching.
In the future, we will focus on providing more information for attention mechanism to select important part more precisely and reduce the situation of repeated reading of one word.